Skip to content

Refactor filterByTainted#27584

Open
Juanadelacuesta wants to merge 5 commits intomainfrom
NOJIRA-filters
Open

Refactor filterByTainted#27584
Juanadelacuesta wants to merge 5 commits intomainfrom
NOJIRA-filters

Conversation

@Juanadelacuesta
Copy link
Copy Markdown
Member

@Juanadelacuesta Juanadelacuesta commented Feb 24, 2026

Description

This PR attempts to refactor de function filterByTainted to make it clearer and easier to expand. It maintains all its functionality and adds testing.
The approach is to have a priority organised classification rules of the shape condition func(allocContext) bool that will determine in with bucket should each allocation end up.
The allocContext contains all the information necessary to classify an alloc:

type allocContext struct {
	alloc           *structs.Allocation
	shouldReconnect bool
	taintedNode     *structs.Node
	nodeIsTainted   bool
	now             time.Time
}

Testing & Reproduction steps

Links

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad product documentation, which is stored in the
    web-unified-docs repo. Refer to the web-unified-docs contributor guide for docs guidelines.
    Please also consider whether the change requires notes within the upgrade
    guide
    . If you would like help with the docs, tag the nomad-docs team in this PR.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.
  • If a change needs to be reverted, we will roll out an update to the code within 7 days.

Changes to Security Controls

Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.

@mismithhisler
Copy link
Copy Markdown
Member

Everytime I see filterByTainted I always have to think about what it's actually doing. I think this PR does a good job of "categorizing" them, and so I wonder if it's also worth renaming the function categorizeNodes or something similar?

@Juanadelacuesta
Copy link
Copy Markdown
Member Author

why Nodes?

@Juanadelacuesta Juanadelacuesta changed the title style: refactor filterByTainted Refactor filterByTainted May 1, 2026
@Juanadelacuesta Juanadelacuesta added backport/ent/1.11.x+ent backport to 1.11.x+ent release line backport/2.0.x backport to 2.0.x release line backport/ent/2.0.x+ent backport to 2.0.x+ent release line labels May 4, 2026
Copy link
Copy Markdown
Member

@gulducat gulducat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fabulous. The code itself looks solid to me. My comments are all about adding even more clarity based on the new structure.

Comment on lines +125 to +126
condition func(allocContext) bool
category allocCategory
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suuuper nit-picky, but reversing the order of these both a) is alphabetical, and b) puts the simpler information (category) first. the latter I think makes the already way more readable classification rule slice just that extra little bit more so.

Suggested change
condition func(allocContext) bool
category allocCategory
category allocCategory
condition func(allocContext) bool

Comment on lines +133 to +138
condition: func(ctx allocContext) bool {
return ctx.shouldReconnect &&
ctx.alloc.DesiredStatus == structs.AllocDesiredStatusRun &&
ctx.alloc.ClientStatus == structs.AllocClientStatusFailed
},
category: categoryReconnecting,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. per my previous comment

Suggested change
condition: func(ctx allocContext) bool {
return ctx.shouldReconnect &&
ctx.alloc.DesiredStatus == structs.AllocDesiredStatusRun &&
ctx.alloc.ClientStatus == structs.AllocClientStatusFailed
},
category: categoryReconnecting,
category: categoryReconnecting,
condition: func(ctx allocContext) bool {
return ctx.shouldReconnect &&
ctx.alloc.DesiredStatus == structs.AllocDesiredStatusRun &&
ctx.alloc.ClientStatus == structs.AllocClientStatusFailed
},

category is always just the one line, so my brain latches on to it a little easier before the variable-length condition.

(I repeat: this is super nit-picky 😋)


// classificationRules are evaluated in order; first match wins.
var classificationRules = []classificationRule{
// Failed allocs that need reconnect and are still desired to run.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Failed allocs that need reconnect and are still desired to run.
// Failed allocs that need to reconnect and are still desired to run.

},
category: categoryExpiring,
},
// Failed reconnects that server already marked to stop should be ignored.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Failed reconnects that server already marked to stop should be ignored.
// Failed reconnects that the server already marked to stop should be ignored.

I assume this refers to the nomad server/leader?

Comment on lines +178 to +185
// Disconnected node and alloc already unknown.
{
condition: func(ctx allocContext) bool {
return ctx.taintedNode != nil &&
ctx.taintedNode.Status == structs.NodeStatusDisconnected &&
ctx.alloc.ClientStatus == structs.AllocClientStatusUnknown
},
category: categoryUntainted,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bonus from these conditions being so well-contained is that it's easier for me to isolate where wording confuses me.

This is saying "Sure, the node is tainted (can't handle allocs), but the alloc has the appropriate status, so it's fine -- we don't need to do anything with it." Right? I.e. the alloc already in the state that we want/expect it to be.

Short of renaming untainted, maybe the comment could elaborate on this a bit?

Suggested change
// Disconnected node and alloc already unknown.
{
condition: func(ctx allocContext) bool {
return ctx.taintedNode != nil &&
ctx.taintedNode.Status == structs.NodeStatusDisconnected &&
ctx.alloc.ClientStatus == structs.AllocClientStatusUnknown
},
category: categoryUntainted,
// Disconnected node, but alloc already marked unknown, so no changes needed.
{
category: categoryUntainted,
condition: func(ctx allocContext) bool {
return ctx.taintedNode != nil &&
ctx.taintedNode.Status == structs.NodeStatusDisconnected &&
ctx.alloc.ClientStatus == structs.AllocClientStatusUnknown
},

or something like that?

alloc: func() *structs.Allocation {
a := makeAlloc("c3", "ready", structs.AllocClientStatusComplete, structs.AllocDesiredStatusRun)
a.DeploymentStatus = &structs.AllocDeploymentStatus{Canary: true}
a.DesiredTransition = structs.DesiredTransition{Migrate: pointer.Of(true)}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mainly to avoid issues after #27875 that removes pointer.Of

Suggested change
a.DesiredTransition = structs.DesiredTransition{Migrate: pointer.Of(true)}
a.DesiredTransition = structs.DesiredTransition{Migrate: new(true)}

name: "migrate flag",
alloc: func() *structs.Allocation {
a := makeAlloc("c11", "ready", structs.AllocClientStatusPending, structs.AllocDesiredStatusRun)
a.DesiredTransition = structs.DesiredTransition{Migrate: pointer.Of(true)}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a.DesiredTransition = structs.DesiredTransition{Migrate: pointer.Of(true)}
a.DesiredTransition = structs.DesiredTransition{Migrate: new(true)}

testCases := []struct {
name string
alloc *structs.Allocation
nodeMap map[string]*structs.Node
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this field; every test case passes in nodes for this value.

Suggested change
nodeMap map[string]*structs.Node

for _, tc := range testCases {
t.Run(tc.name, func(t *testing.T) {
all := allocSet{tc.alloc.ID: tc.alloc}
state := ClusterState{Now: now, TaintedNodes: tc.nodeMap}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to other comment,

Suggested change
state := ClusterState{Now: now, TaintedNodes: tc.nodeMap}
state := ClusterState{Now: now, TaintedNodes: nodes}

all := allocSet{tc.alloc.ID: tc.alloc}
state := ClusterState{Now: now, TaintedNodes: tc.nodeMap}

untainted, migrate, lost, disconnecting, reconnecting, ignore, expiring := all.filterByTainted(state)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll put this comment here so github makes it a thread to discuss, only tangentially related to this line.

I agree with @mismithhisler that we could rename this func, because filterByTainted is not very intuitive, but you're right that it's not Nodes that are being categorized. How about categorizeAllocs?

I also think the doc comment could be a little clearer, while we're at it, something like

// categorizeAllocs takes a set of tainted nodes and filters the allocation set
// into the following groups:
// 1. untainted: allocs on untainted nodes (AND clarify what exactly this means...)
// 2. migrate: allocs on nodes that are draining and need to be moved
// 3. lost: allocs on lost nodes or have expired and need to be replaced
// 4. disconnect: allocs on nodes that are disconnected, but have not had their ClientState set to unknown
// 5. reconecting: allocs on a node that has reconnected, and need ......
// 6. ignore: allocs in a state that results in a noop that only need to be counted
// 7. expiring: allocs on disconnected nodes and need to be marked lost (and possibly replaced)
func (set allocSet) categorizeAllocs(state ClusterState)
(untainted, migrate, lost, disconnecting, reconnecting, ignore, expiring allocSet) {

?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and what do we think about backports? The prospect of merge conflicts always sets me on edge, but previous refactors (e.g. #26042 and #26324) were not backported... I assume that's why you targeted 1.11.x+ent and 2.0.x? (PS shouldn't need 2.0.x+ent, because CE->Ent merge will catch that).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/ent/1.11.x+ent backport to 1.11.x+ent release line backport/ent/2.0.x+ent backport to 2.0.x+ent release line backport/2.0.x backport to 2.0.x release line

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants